Search CORE

When Amdahl Meets Young/Daly

Author: Cavelan Aurélien
Li Jiafan
Robert Yves
Sun Hongyang
Publication venue: IEEE Computer Society
Publication date: 13/09/2016
Field of study

International audienceThis paper investigates the optimal number of processors to execute a parallel job, whose speedup profile obeys Amdahl's law, on a large-scale platform subject to fail-stop and silent errors. We combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both error sources. We provide an exact formula to express the execution overhead incurred by a periodic checkpointing pattern of length T and with P processors, and we give first-order approximations for the optimal values T * and P * as a function of the individual processor failure rate λind. A striking result is that P * is of the order λ −1/4 ind if the checkpointing cost grows linearly with the number of processors, and of the order λ −1/3 ind if the checkpointing cost stays bounded for any P. We conduct an extensive set of simulations to support the theoretical study. The results confirm the accuracy of first-order approximation under a wide range of parameter settings

Crossref

Optimal resilience patterns to cope with fail-stop and silent errors

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: HAL CCSD
Publication date: 01/10/2015
Field of study

This work focuses on resilience techniques at extreme scale. Many papers dealwith fail-stop errors. Many others dealwith silent errors (or silent data corruptions).But very few papers deal with fail-stop and silent errorssimultaneously. However, HPC applications will obviously have to cope with both error sources.This paper presents a unified framework and optimal algorithmic solutions to this double challenge.Silent errors are handled via verification mechanisms(either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, anda full characterization of the optimal pattern. Our results nicely extend several published solutionsand demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors. Extensive simulations based on real data confirm the accuracy of the model, and show that patterns that combine all resilience mechanisms are required to provide acceptable overheads

Scheduling Independent Tasks with Voltage Overscaling

Author: Cavelan Aurélien
Robert Yves
Sun Hongyang
Vivien Frédéric
Publication venue: HAL CCSD
Publication date: 18/11/2015
Field of study

International audienceIn this paper, we discuss several scheduling algorithms to execute independent tasks with voltage overscaling. Given a frequency to execute the tasks, operating at a voltage below threshold leads to significant energy savings but also induces timing errors. A verification mechanism must be enforced to detect these errors. Contrarily to fail-stop or silent errors, timing errors are deterministic (but unpredictable). For each task, the general strategy is to select a voltage for execution, to check the result, and to select a higher voltage for re-execution if a timing error has occurred, and so on until a correct result is obtained. Switching from one voltage to another incurs a given cost, so it might be efficient to try and execute several tasks at the current voltage before switching to another one. Determining the optimal solution turns out to be unexpectedly difficult. However, we provide the optimal algorithm for a single task, the optimal algorithm when there are only two voltages, and the optimal level algorithm for a set of independent tasks, where a level algorithm is defined as an algorithm that executes all remaining tasks when switching to a given voltage. Furthermore, we show that the optimal level algorithm is in fact globally optimal (among all possible algorithms) when voltage switching costs are linear. Finally, we report a comprehensive set of simulations to assess the potential gain of voltage overscaling algorithms

Crossref

Two-Level Checkpointing and Verifications for Linear Task Graphs

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2015
Field of study

International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In this paper, we show how to combine all of these techniques for HPC applications whose dependency graph forms a linear chain. We present a sophisticated dynamic programming algorithm that returns the optimal solution in polynomial time. Simulation results demonstrate that the combined use of multi-level checkpointing and verifications leads to improved performance compared to the standard single-level checkpointing algorithm

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

International audienceIn this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments